Cloud computing outage or failure of the customer SW architecture?

_ July 4, 2012_ fabio.cecaro_ 0 Comments

Cloud computing outage or failure of the customer SW architecture?

I find it very funny to see on the net how people are constantly and often inappropriately talking about cloud computing both positively and negatively. Everyone “shoots” their own opinion by calling themselves an expert on this or that.

As the title suggests, this is a new “disastrous” event for cloud computing. And I feel it is my duty to analyze the case and respond to all those who have only been able to demonize “cloud computing” as if it were something that can be pointed at.

I remind most that cloud computing is a paradigm not a technology, I remind you that it can be public and private, for those who have forgotten it or have never had the opportunity to read an official definition.

Event

Let’s get to what happened this time, at least as far as we know.

A very strong storm hit the USA on June 30 in the Virginia area, causing at least 12 deaths, who knows how many injured, how many were left homeless. It is also known that 2 million people throughout the area have been without power for a long time, and technicians immediately moved to try to restore power as quickly as possible, in order to prevent the record heat in that area from increasing the number of deaths due to the non-functioning of the air conditioners.

In Virginia there are Amazon Web Services datacenters, this is the historical region of AWS, the first and also the “default” one of the management APIs, that is, if the customer does not change region he will always and only use the resources of this area (just add –region in the case of the API Tool or make a set_region if we use the php SDK or maybe more easily with the Management Console), while AWS is distributed all over the globe (US, EU, ASIA, Japan, South America) with many datacenters.

One of these datacenters has lost power, according to Amazon. (because AWS makes public the status of all its services in all regions).

However, many of the services of many customers were present in this region and found themselves offline from one moment to the next. The case caused a sensation because this time two big social portals very fashionable, Instagram and Pinterest (also NetFlix), which remained offline for a few hours, fell.

But I wonder how many other lesser-known datacenters with old-style hosting have had the same fate?

It was an extreme event (probably because of the way our planet’s climate is going, it won’t be the last),

AWS is the first and largest Cloud Provider, the most widespread, the most used, the one with the greatest presence of sites and portals with considerable traffic, therefore it is more likely to be noticed on the net by a wide audience, let’s say that it is very much under scrutiny.

On April 21, 2011 the same region was compromised by human error and in my opinion in this event lies the real problem that large infrastructures full of data in continuous increase have to face, on August 7 instead the Irish datacenters of AWS and Microsoft were compromised by a lightning storm, in this case it seems that the DCs were equipped with the same automatic backup systems of electricity, a system evidently blown due to the extra-voltage of the lightning that fell nearby.

Let’s consider the two cases originating in extreme weather problems without playing too much with the symbols between clouds and storms, as many columnists have done.

We are dealing with datacenters that have had problems related to energy backup systems and here a brief description of how the energy backup systems of a DC work is in order. A DC generally receives electrical power from an energy provider, this goes directly to the UPS (basically battery packs), then from there it goes to the server rooms full of racks where the network equipment, servers and storage are located.

A little parenthesis of GreenIT. About 30% of energy is lost in the transition from alternating current to direct current (UPS) and from direct to alternating current to alternating current to alternating current, while each device internally converts electrical energy into direct current. This is one of the various points that we have raised in our Make The Cloud Green project.

There are therefore totally automatic and reliable systems that intervene by stabilizing the electricity to the server rooms, i.e. if there is no power at the source, the equipment takes it from the UPS groups. Obviously, these are calibrated to withstand the full load a few tens of minutes, so each self-respecting DC is supplied by more than one group of generators that always automatically come into operation if the energy body no longer supplies energy. Usually these units, generally consisting of diesel engines, take very few minutes to start, so the operation of a DC should be guaranteed for many hours, that is, until the end of the fuel.

Another GreenIT parenthesis: this case of a bank entirely powered by fuel cells, i.e. off-the-grid

In general, all these devices, which may never come into operation, require periodic monitoring, maintenance and functional tests that result in scheduled simulations of catastrophic events. Therefore it is a mistake to talk about Cloud Computing Outage , we have to focus on the fact that the DCs have been poorly designed, or have not been properly maintained or simulated.

Or maybe AWS should be asked to make public the tests and results that are done to all the DCs’ surrounding equipment, or rather create new standards, new regulations related to the construction of DCs, make them energetically separated from the grid, safer against atmospheric events, for example let’s see this datacenter built in Stockholm in a former nuclear bunker. Another GreenIT parenthesis would be to heat adjacent towns with the heat emitted by the systems rather than wasting more energy to cool the systems and more to heat the homes. (someone already does)

Customer’s SW Responsibilities

Now let’s get to what customers who found themselves offline underestimated. AWS provides IaaS-grade cloud computing tools, provides multiple datacenters, and provides guidance on how to implement your own architectures while emphasizing the use of multiple datacenters for business-critical applications. It means that an Instagram that has received millions of dollars of investment, that does not know how to design the solution by relying on more than one datacenter, is to be condemned, not cloud computing itself, nor AWS for how Instagram’s SW architecture has been designed.

AWS provides a number of services on which it guarantees Fault Tolerance and High Availability and they are S3, SimpleDB, Simple Queue Service, Elastic Load Balancing, Elastic IP and I would also say Route53 and on which you can build your own High Available services, as many other smarter customers have already done.

On the other hand, it would be very unseemly to see a Public Cloud Provider PaaS and above all SaaS with High Availability problems, as happened some time ago for example to Google’s Gmail.

Finally, let’s enjoy a list of the 10 worst cases of Cloud Outage that have not been updated so far.

Colossal cloud outage No. 1: Amazon Web Services goes poof
Colossal cloud outage No. 2: The Sidekick shutdown
Colossal cloud outage No. 3: Gmail fail
Colossal cloud outage No. 4: Hotmail’s hot mess
Colossal cloud outage No. 5: The Intuit double-down
Colossal cloud outage No. 6: Microsoft’s BPOS oops
Colossal cloud outage No. 7: The Salesforce slipup
Colossal cloud outage No. 8: Terremark’s terrible day
Colossal cloud outage No. 9: The PayPal fall-down
Colossal cloud outage No. 10: Rackspace’s rough year

Now a little tip to stimulate the standard of building a new generation of off-the-grid datacenters. NASA has been officially warning for months, if not at least a couple of years, that solar storms are constantly increasing, reaching their peak from the end of 2012 to the end of 2013 and that according to them could put our digital infrastructures at serious risk with many long blackouts.

Author

Single Blog

Leave a comment Cancel reply

Single Blog

Event

Customer’s SW Responsibilities

fabio.cecaro

ECDAY – excellent results and adhesions

VMware: A Lot of Rumors

Leave a comment Cancel reply